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O ' Abstract 

O" 

^ ; The method of tempered transitions was proposed by Neal (1996) for tackhng the difficulties arising 

when using Markov chain Monte Carlo to sample from multimodal distributions. In common with 
methods such as simulated tempering and Metropolis-coupled MCMC, the key idea is to utilise a series 
of successively easier to sample distributions to improve movement around the state space. Tempered 
transitions does this by incorporating moves through these less modal distributions into the MCMC 
proposals. Unfortunately the improved movement between modes comes at a high computational cost 
with a low acceptance rate of expensive proposals. We consider how the algorithm may be tuned to 
, increase the acceptance rates for a given number of temperatures. We find that the commonly assumed 

geometric spacing of temperatures is reasonable in many but not all applications. 
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1 Introduction to tempering ideas in MCMC 

X 

It is well known that standard Markov chain Monte Carlo (MCMC) methods, such as the Metropolis- 
Hastings algorithm or the Gibbs sampler, often have difficulties in moving around their target distribution. 
When a chain mixes poorly in this way, there is a danger that modes have been missed or that modes are 
not represented in their right proportions, both of which may lead to bias in the statistical inference. To 
overcome such mixing problems, various more sophisticated MCMC methods have been devised based on 
a few key ideas. This paper concentrates on one of these key ideas, namely tempering. 

One way to motivate tempering is to think of using importance sampling to estimate some expectation 
E Po [h(X)] with respect to the target distribution po by sampling from some less modal distribution p\. 
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One possibility for generating a less modal distribution than po on the same support is to "flatten" it by 
taking pi(x) oc po(x)P , Vx, with (3 < 1. As /3 — > 0, pi(x) becomes closer to a uniform distribution 
and consequently becomes more amenable to sampling. For (3 close to 1, there is far less benefit as p\ 
may not be that much less modal than pq. Unfortunately as the two distributions become far enough apart 
that the difficulty with modality is overcome, they may also become far enough apart so that many of the 
importance weights will be very close to zero resulting in unstable estimates of E Po [h(X)]. The basis of all 
the tempering methods is the introduction of a series of distributions bridging the gap between po an d Pi- 
The differences between the various approaches is in how these bridging distributions are included. We will 
describe various approaches to incorporating bridging distributions using the common form of tempering 
which involves powering up all or part of the unnormalised target distribution. The inclusion of other types 
of bridging distribution would also be possible, but the literature has generally restricted itself to this form. 
We assume that the target distribution can be written as 

p(x) oc ir(x) exp(-/3 h(x)), (1) 

where h(x) may be known as the "energy" function and the parameter (3q as the target "inverse temperature". 
Since we can write any positive function f(x) in exponential form f(x) = exp(— /3q h(x)) by setting 
h(x) = — 4^ log(/(x)) this class covers a wide range of applications. The tempered distributions are then 
defined by 

Pi(x) oc tt(x) exp(-/3j h(x)), i = 0, 1, . . . ,n, (2) 

where < (3 n < . . . < fi\ < (3q, are the inverse temperatures characterising each distribution. The 
flexibility of potentially only tempering part of the target distribution is quite useful. In Bayesian problems 
it may be that one or other of the prior and likelihood contribute to the mixing problems. 

One of the earlier suggestions for incorporating tempering into MCMC is to run n + 1 Markov chains in 
parallel, each sampling from one of the n + 1 tempered distributions. At each iteration, proposals are made 
to update each chain separately and additionally there is a proposal to swap the x values between chains 
thereby coupling them and giving rise to the name Metropolis-coupled MCMC (Geyer 1991). The state 
space is the enlarged set of (n + 1) values for x and the target distribution is po ® p± ® . . . £g> p n . The idea is 
that large moves made under p n will filter back down to the lowest level po- The normalising constants for 
the tempered distributions are not needed in this method as they appear only in the acceptance probabilities 
for the coupling move where they cancel out. However the tempered distributions do need to be close in 
order that the swaps between them are not too infrequent. This may mean that n will have to be large and 
there are then obvious consequences for storage and computational effort. 

A single chain alternative to Metropolis coupling is simulated tempering (Marinari and Parisi 1992, 
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Geyer and Thompson 1995) which runs a chain on the state space of x augmented by a variable i which takes 
the values i = 0, 1, . . . , n with probabilities determined by a "pseudo-prior". The stationary distribution of 
x\i is Pi(x), and updates are either of x\i or i\x with the latter effectively moving up or down the tempering 
sequence. Although again the normalising constants of the tempering distributions are not needed explicitly, 
in practice to get reasonable acceptance rates for the moves between temperatures, the pseudo-prior needs 
to be roughly proportional to these unknown normalising constants. 

Tempered transitions (Neal 1996) is another single chain method but without the need to guesstimate the 
relative normalising constants of the tempered distributions. It uses a deterministically ordered sweep up and 
down the tempering distributions as a way of generating proposals for the main chain in a way which will be 
described in more detail in the next section. The overwhelming cost of the algorithm is in the construction of 
the proposals and therefore it is imperative that these are tuned carefully to maximise acceptance rates. Neal 
(1996) finds tempered transitions and simulated tempering to be of roughly equal computational cost. He 
also compares tempered transitions, simulated tempering and Metropolis-coupled MCMC on other criteria 
such as storage requirements and the number of tempering levels required concluding that there is no overall 
winner and that the choice of method may be problem and goal specific. 

Closely related methods which aim to make fuller use of the sampling at all temperature levels via im- 
portance sampling can be found in Neal (2001) and more recently in Gramacy, Samworth and King (2010). 
The former has links to tempered transitions, effectively using just the second half of the complex proposal 
mechanism. The latter has stronger connections with simulated tempering and Metropolis coupled MCMC 
where samples from the different temperatures are stored. Other instances of ideas involving populations of 
samples can be found in both the population-based MCMC and the Sequential Monte Carlo literature (see 
Jasra, Stephens and Holmes (2007) for an overview). 

A common question arising across the algorithms involving tempering is the choice of the bridging dis- 
tributions given by Equation ((2]). The general recommendation is to space the /3s geometrically, that is so 
that fii/Pi+i is a constant for all levels i (Neal 1996). Neal formulates this rule by considering sampling 
from a multivariate Gaussian using simulated tempering; the geometric spacing attempts to maximise the 
acceptance rates of swaps between neighbouring chains at all levels. Gelman and Meng (1998) also con- 
sider choices of bridging distribution, although in the context of the closely related question of estimating 
normalising constants where they are trying to minimise the Monte Carlo error of path sampling estimates. 
Other work on rationales for choosing the /3s can be found in Iba (2001) and Lefebvre, Steele and Vandal 
(2010). The former reviews the (largely physics) literature, comparing simulated tempering with exchange 
and ensemble Monte Carlo methods and aims to maximise the swapping rates between the bridging distribu- 
tions using preliminary runs (to satisfy a theoretically derived optimality criterion). The latter is interested 
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in path sampling for estimating normalising constants and the related tuning of the bridging distributions; 
they derive an expression for the symmetrised Kullback-Leibler divergence between pairs of distributions 
and use the minimisation of this as their criterion. 

In this paper we consider tempered transitions with bridging distributions of the form given by Equa- 
tion ©. Of the various tempering schemes, the choice of the {fti} seems to have been least well addressed 
for tempered transitions. Our approach is largely computational and tries to answer the question "For a given 
number n of tempering distributions, how best should they be spaced?". We note that the question of how we 
should choose n, for fixed computational time, is beyond the scope of this article. The approach we take is 
to use a small number of preliminary short runs to assess whether geometric spacing is likely to be adequate 
and, if not, we propose an optimal way of spacing them. In Section [2] we describe tempered transitions in 
detail, building on many of the insights offered in Neal's paper. We provide a theoretical analysis which 
outlines when geometric temperature spacing is optimal and give a motivating example where geometric 
spacing is sub-optimal. We also draw some parallels with some of the other theoretical approaches outlined 
above. In Section [3] we discuss the implementation details of applying our proposed approach to a slowly 
mixing MCMC application. 

2 How to tune tempered transitions? 
2.1 The tempered transitions algorithm 

We begin by describing the algorithmic details of the tempered transitions algorithm and setting up the 
reasoning behind our tuning approach. Suppose the chain is currently in state x, then the algorithm generates 
a proposal x' for the next state using a secondary chain which passes through all the auxiliary distributions 
{pi} first in ascending order of the /3s ("heating-up") and then in descending order ("cooling-down") back 
to the target distribution po- To do this, it uses n pairs of MCMC transition kernels with the i th pair, Tj and 
T[ satisfying detailed balance with respect to pi 

Pi(x) Ti{x, x') = Pi{x') T((x', x) Vx, x',i = 1, . . . ,n. 

Step 1 Set x = x. 

Step 2 Move up and down the tempered distributions using MCMC transitions 

Generate x\ from xq using transition kernel T\. 
Generate X2 from x\ using transition kernel T^. 
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Generate x n from x n -\ using transition kernel T n . 
Generate x' n _ x from x n using transition kernel T' n . 



Generate x[ from x' 2 using transition kernel T' 2 . 
Generate x' from x[ using transition kernel T[. 

Step 3 Accept x' = x' as the next state with probability 

a(x, x'\xo,xi 
otherwise, remain at state x. 



[,..., X n , X n _i, . . . , Xi, Xq) 



min < 1, 



n— 1 i \ 

-q Pi+l(Xi) 

i- 



n-l 

n 

Li=0 



Pi( x 'i) 



(3) 



There is no need for the normalising constants of the tempering distributions as they cancel in the acceptance 
probability Neal (1996) demonstrates that the algorithm satisfies detailed balance with respect to the target 
po. However it is perhaps clear that the proposal is potentially computationally costly and tricky to tune. 

There are two dependent aspects to the tuning. The first is that the {Tj,T/} should have reasonable 
acceptance rates and, at the later stages of the tempering, be able to make large moves in the state space 
(otherwise this expensive proposal makes little change). This is not quite the usual tuning problem for 
MCMC in that each successive level has a different target distribution. During the first half, these distri- 
butions are becoming progressively less modal, while in the second half the reverse is true. Obviously the 
closer the consecutive distributions, ie the closer the consecutive /3s, the less of an effect this will be. 

Subject to the individual {Tj,T/} working well, the second tuning aspect is that the overall acceptance 
rate of the entire tempered transition proposal should be as high as possible (although notice that if all the 
proposed changes in the tempering are rejected, the overall proposal will be accepted since xo = x± = ... = 
x[ = x' and so a high acceptance rate here can be slightly misleading if viewed in isolation). To gain more 
insight into tuning the tempered transition acceptance rate, we follow Neal's lead and rewrite the acceptance 
probability, Equation ©, using the form of tempering defined by Equation (f2]). 



Q?(x, X |xq, X\^ . . . j X^j *^o) 



min < 1 , 



y} exp(-/3 m /i(xj)) 
}} exp(-/3ih(xi)) 



n-l 

n 



J Li=0 



min{l,exp(-(F' - F))} 



(4) 



n-l 



where F 



and F' 



n-l 



i=0 



This expression has an interpretation related to estimating the ratio of normalising constants by thermody- 
namic integration. Let Z{(3) denote the normalising constant of the distribution defined by Equation (f2]), 



where for the moment we assume that j3 takes continuous values in the interval [/3q, P n ]. 



then 



dZ{p) 
dp 




Z(P) dx 



-Z(/3)Ep[h(X)]. 



(5) 



Solving this differential equation for Z(/3) gives the relation 

log(Z(A0) - log(Z(A))) = f ° E p [h{X)) d(3. (6) 

That is, the log of the ratio of the normalising constants at two values of ft can be expressed as the area under 
the curve g(/3) = Ep[h(X)] between them. Recall that the sequences {xq, . . . £ n -i} and {x' n _ l7 . . . x' Q } are 
drawn such that the Xi have target distribution pi, while the x\ have target distribution Figure Q] 

illustrates a slightly idealised realisation of F (left) and F' (right) as the shaded areas constructed as the sum 
of rectangles with widths (/% — Pi+i) and heights h(xt) (left) or /i(a^) (right). Both areas are approximations 
of the integral of g(f3) between p n and Pq. Different realisations of {xq,x\, . . . ,x' } will obviously give 
quite different and usually rather messier pictures, with correspondingly quite varied values of F' — F. 
(In fact, Figure Q] was constructed using the average of several realisations to reduce this variability for 
presentation purposes.) Those realisations of xq, x±, . . . ,Xi, x' where the shaded area on the right (F') is 
smaller than that on the left (F) will be accepted since in that case exp(— (F' — F)) > 1 in Equation ((U). 
Those for which F 1 is slightly larger than F may be accepted, but we will almost certainly reject those for 
which (F' — F) is large. We take this as a motivation for selecting the {/3j} for fixed n. 

2.2 The proposed rationale for choosing {/3j} 

Given the cost of each tempered transition proposal, our motivation is to increase the number of proposals 
accepted. The value of Pq is fixed by Equation dB, and we assume that the other extreme of the /3s is also 
determined, possibly by the fact that it defines a distribution for which direct sampling is possible, certainly 
by the need to move around the state space freely under p n . What remains undetermined are n and the set 



Figure Q] showed the F and F' associated with a realisation {xq,x\, . . . , x' }. If at each stage the tran- 
sitions were able to reach their equilibrium distributions in the one step available, then E[/i(xj)] = g{Pi) 
and E[/i(a^)] = g{Pi + \); Figure |2] shows the corresponding approximations to the integral of g(P) (this is 
equivalent to Neal's Figure 1(a)). Denote this difference between the areas of these two step functions as a 



{Pl,...,Pn-l}. 
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Figure 1: Two approximations to the integral of g(/3). The breakpoints of the rectangles are given by 
Pn < • • • < Pi < A)- The shaded area on the left is F = J27=o(Pi ~ Pi+i)h(xi), while that on the right is 
F' = EtoiPi ~ Pi+i)h(xd- The overlaid curve is g(p) = E p [h{X)]. 



n-l 



n-1 



function of the P values 

s n (p ,...,Pn) = ^(A-^O^itM^l-ECA-A+OEilW] 

i=0 i=0 
n— 1 n— 1 

i=0 i=0 

Some results are well known for g(/3) = E^[/i(X)]. Rewriting Equation (0) as 

p /3 (x)=7r(x)exp(-/3/ l (x)-^(/3)) 

so that K(P) is the log of the normalising constant Z(P) and rearranging Equation ©, 

1 dZ(/3) 



dlog(Z(P)) 

dp 
-K'(P) 



ande/(/3) = | 



h{ x )-jpPp( x ) dx 



(V) 



(8) 



(9) 



Expected h(X) values Expected h(X') values 




0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 



P P 

Figure 2: The shaded area on the left is Yh=c i (A ~~ while that on the right is ^?=o (A ~~ 

/3j + i)<7(/3j + i). The difference between the two areas is S n ((3o, . . . , /3„). 

= J h{x)(-h{x) - K'(P))tt(x) exp(-/3 h{x) - K(/3))dx 
= J(-h(x) 2 + h{x)g{P))pp{x)dx 

= -Var p [h(X)]. (10) 

Therefore g'(f3) < 0, for all (3, showing that g((3) is a decreasing function of /3. It is possible to examine 
g"{P) similarly, showing that the curve may be convex, concave, or a mixture of the two. The main point 
here is that because g{f3) is decreasing, we know that S n (f3o, . . . , j3 n ) > 0. 

We propose the minimisation of 5 n (/3o, . . . , (3 n ) over as our rule for choosing the tempered transi- 
tion parameters. Obviously increasing n immediately reduces S n . However our primary motivation here is 
the most effective choice of the particular . . . , /3 n _i for a fixed number of levels n and fixed values of 
/3o and p n . 

Note that minimising S n = E[F' — F] is not directly equivalent to maximising the expected value of the 
acceptance probability, a = min{l, exp(— (F' — F))}, however 

E[exp(-(F' - F))) = 1-S n + l[ 2 , > 1 (11) 

and so intuitively minimising S n seems a reasonable start. Other possible criteria include, for example, 
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maximising P(F' < F) over the {ft} or, as suggested by one of the referees, examining the variance as 
well as the expectation of F' — F since a high variance could perhaps improve mixing by generating big 
moves more often than a low variance might. We have so far only considered looking at S n . 



2.3 A motivating example 

To motivate the tuning procedure, we study the one-dimensional two-parameter simplified Witch's Hat 
distribution used by Geyer and Thompson (1995). Although this is quite a straightforward example, we 
shall see that it is one for which geometric spacing of the temperatures is not optimal. Geyer and Thompson 
attribute this distribution to Matthews (1993) who introduced it as a problem case for the Gibbs sampler: 

p(x) oc 1 + b\[ x < a ], < x < 1 

where the parameters satisfy < a < 1 and b > 0. This apparently innocuous L-shaped distribution causes 
problems for standard Metropolis-Hastings moves if a is small but b is large as it can be difficult to move 
between the intervals [0, a] and (a, 1]. The distribution can be expressed in a form suitable for tempering 

Pi(x) oc exp(ftlog(l + &l[a;<o])), i = 0, ... ,n 

where /3 = 1. In this case, g(/3) = Ep[— log(l + 6l[ x < a i)] is available analytically as are its derivatives: 

-a(l + bf log(l + b) 



m 



o(l + b)P + (1 - a) 
a(o -l)(l + 6)^(log(l + 6)) 2 
(a(l + bf + 1 - a) 2 



m = - a(a - 1)(1 + b)>S(1 t b))3 (a(l + bf-(l-a)) (12) 



(a(l + bf + (1 - a)) 3 

The second derivative shows that for (3 G [0, 1], the curve g(/3) may be convex (if a 6 [0.5, 1] and b > 0), 
concave (if a € (0, 0.5) and b G (0, 1/a — 2]), or a mix of the two (otherwise). We propose to study one 
distribution for which g((3) is convex (taking a = 0.5 and b = 7.5 x 10 8 ) and one for which it is concave 
(taking a = 10~ 4 and b = 9.5 x 10 3 ). The latter distribution is hard to sample with roughly half the mass 
concentrated in the narrow [0, 9.5 x 10 3 ] peak. The former distribution does not present a sampling problem, 
but we are still interested in the effect of the shape of g(/3) on tempering performance. 

Given g(/3), /3o> Pn an d n > now do we minimise S n over {j3i, . . . , /3 n -i}> Equation ©? The (n — 1) 
partial derivatives are available, 
(99 

— ^ = (o(A-i) - 2g(fii) + g(Pi+i)) + (A-i - Wi + A+iV (A), i = 1, . . . , n - 1, (13) 

however, despite their relatively simple form, no analytic solution is readily available for the minimisation 
problem. As a result, we perform the minimisation numerically using the built-in quasi-Newton optim with 



option L-BFGS-B in R incorporating derivative information and the constraints that (3 n < < fio for 
i = 1, . . . , n — 1 and fixed /3o and (3 n . This function is not guaranteed to converge to a global maximum; 
in our experiments it was insensitive to starting points (we used geometric or uniform spacings) with the 
exception of occasional catastrophic convergence for large n to a non-ordered set of /3s. In all cases we 
encountered, changing from geometric to uniform initial spacings, or vice versa, resolved this problem. 
Figure [3] illustrates the minimal S n and the corresponding {/?,} for the two examples of the Witch's hat 
distribution when n = 4 and f3 n = 1/16, as well as the equivalent S n for a geometric spacing of the {Pi}. 
Unsurprisingly given the different shapes of the two curves, the change in the size of S n achieved by the 
optimal scheme over the geometric one is more significant for the concave g(/3) curve. However, even 
in the convex example it is clear that the values of the /3s themselves, particularly /3i, are quite different 
under geometric spacing and the minimal S n scheme. Table \T\ shows the minimum values of S n and the 
geometric values of S n for n = 2, 4, 8, 16, 32 and 64 and for both pairs of parameters a and b. For each 
n we performed 500000 iterations of tempered transitions where at each level i = 1, . . . ,n we use direct 
sampling (by inversion) to draw from the tempered distribution pj. This direct sampling is only realistically 
possible, at least for values of j3 close to (3q, in a test example such as this, but it does allow us to separate 
the effects of the different ft choices from the effects of slow mixing of the transitions at levels 1 to n. Table 
Q] gives observed average acceptance rates together with the estimated integrated autocorrelation time of the 
tempering calculated with respect to the known theoretical mean. The Witch's Hat example is unusual in 
that only moves between the regions < x < a and a < x < 1 are problematic while all within-region 
moves are always accepted (giving rise to unusually high acceptance rates for a tempering problem). In 
addition, a high acceptance rate in tempered transitions can actually mask a lack of mixing and so integrated 
autocorrelation times are a useful diagnostic. 

For both distributions, decreasing S n by optimising the {/?«}, increases the observed acceptance rate and 
decreases the integrated autocorrelation time. These improvements are most noticeable when comparing the 
geometric and optimal schemes for the hard sampling problem, a = 10 -4 and b = 9.5 x 10 3 , where the 
changes in S n are most dramatic. Concentrating on n = 4 to tie in with Figure [3l for the easier convex 
g(fi), optimising the {/3j} made a small difference to the overall S n albeit with noticeable changes to the 
{/3j} themselves. Here the tuning has made only marginal improvements in acceptance rates and integrated 
autocorrelation times (although as noted earlier, this distribution is not hard to sample and there is little 
scope for improvement anyway). In the harder sampling problem, where g{(5) was concave, the benefits of 
tuning the {/3;} are very clear. In this example, the additional computational cost of tuning comes only from 
the R optimisation stage. The benefits of tuning are greatest when n is small (as n increases, the geometric 
S n anyway decreases to zero). 
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a=0.5, b=7.5 x 1 O 8 



a=0.5, b="7.5 x 1 O 8 




n i i i i i n i i i i r 

O.O 0.4 0.8 O.O 0.4 0.8 



Pgeometric Poptimal 

Figure 3: S n for the Witch's hat distribution when n = 4 using geometric {f5{\ spacing (left) or the optimal 
{/3i} (right) overlaid by g{(3) in black; on the top row, a = 0.5 and b = 7.5 x 10 8 , on the bottom row 
a = lCT 4 and b = 9.5 x 10 3 . 



2.3.1 When is the geometric temperature placement optimal? 

An interesting question raised by this example is under what circumstances will tuning of the {/3j} be likely 
to make efficiency gains over the default geometric spacing for fixed n? Suppose the target distribution is 
the d-dimensional multivariate Gaussian with mean fi and variance S. Then, h(x) = i(x — fi) T T l ~ 1 (x — /i) 
and the tempered distributions are (i-dimensional multivariate Gaussian with mean /j, and variance /jr E. 
More importantly, g(/3) = ^ and g'(/3) = As a result, when the are geometrically spaced, all 
the partial derivatives ^ L = in Equation ([TBI , and so geometric spacing is also the optimal minimum S n 
spacing. In fact we can go further: suppose the set of QMr- are all zero for a general g and for all n when the 
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n — A 
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a = 0.5 
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0.83386 
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b = 7.5 x 10 


a 
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/~\ OA 
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A on 
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A A 1 
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A AO 
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(convex) 
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0.63456 
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6 = 9.5 x 10 


a 


0.55 


0.63 


0.72 


/~\ OA 

0.80 


0.85 


A AA 

0.90 


(concave) 


T 


7.05 


2.36 


1.75 


1.47 


1.33 


1.22 




Geometric 


















3.34158 


2.20779 


1.25229 


0.64996 


0.32786 


0.16428 




a 


0.51 


0.51 


0.55 


0.61 


0.69 


0.78 




T 


591.36 


55.56 


9.13 


3.11 


1.91 


1.54 



Table 1: Results for the Witch's Hat problem under two settings of the parameters a and b and multiple 
choices of n, the number of tempering levels. The minimal sum of squares, observed acceptance rates and 
estimated integrated autocorrelation times are shown for the geometric scheme and for the optimal scheme. 



8- f R \ ^-l n 

{Pi} are geometrically spaced, i.e. when = c n where c n = ( £M , /3 n ^ 0, then 

i-2 ft + c„ ft ■' = '-■■■-1. 
As n — >> oo with fixed /3 and /3 n , c n — > 1 and a repeated application of L'hopital's rule yields the equation 

/V(/3) = -(/3V(/3) + /V(/3)) (15) 

which has general solution g((3) = ^ + for constants K\ and i^2- In other words, geometric spacing 
only minimises S n if the target distribution has this form of g{ft). This is a wider class than just the Gaussian, 
for example the exponential distribution has g{f3) = 4. The result ties in with Figure|3]and TableQ] 
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2.4 An alternative perspective on the optimisation problem 

In this section we provide an intuitive formalisation of what quantity S n in Equation © represents. Since 

Pi(x) = tt(x) exp(—/3ih(x))/Z(Pi), it follows that 

Z(f3 i+ i) pj(x) exp(-f3 i+ ih(x)) 

Z{fii) exp(-Pih(x)) Pi+i(x) 

Pi{x) 



exp(-(/3 i+ i - Pi)h{x))- 



Pi+i(x) 
Taking logarithms, 

Multiplying both sides of the equation by pi + ±(x) and integrating with respect to pi + \{x) leads to 

log (^fer) = (ft " /3 l+ i)E i+1 [h(X)] - KL \p i+1 , Pi ] , (17) 

where KL\pi + \,pi] = J x pi + i(x)log ( P p + (i^ )> i s tne Kullback-Leibler divergence between distributions 
Pi + i and pi. Similarly, it can be shown, by multiplying both sides of Equation (fTBT ) by pi(x) and integrating 
with respect to x, that 

log {^^P) = (ft - A+i)Ei[/ipO] + KL[p u p l+l \. (18) 



Summing both Equations (P"P71) and (PT8l) over i indices leads to 

Z(ft 



1o s(§t§t) = E^-ft+OEmi^-E^^+i,^ 



E(ft -/3 i+ i)Ei[M^)] +Y, KL \Pi>Pi+i]- 



It now follows directly that 



n— 1 

S n (/3 , ■•• ,ftj) = ~ E {^ L bi+l>P*] + ^biiPi+l]} 

Thus our optimisation problem can be recast as one of finding temperatures {/3i, . . . ,/3 n _i} to minimise 
the sum of the symmetrised Kullback-Leibler distances between successive distributions p-i and Pi+\. This 
interpretation ties in with the recent work by Lefebvre, Steele and Vandal (2010) who consider this same 
symmetrised Kullback-Leibler divergence in picking optimal schemes for path sampling. A similar per- 
spective, but in the context of marginal likelihood estimation using the power posterior method of Friel and 
Pettitt (2008), appears in Section 3.2 of that paper and also in Calderhead and Girolami (2009). 
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3 Application of the tuning to a non-toy problem 



3.1 A Bayesian mixture problem 

We now turn to Bayesian mixture modelling, an application where tempered transitions has been advocated 
in the past as a possible solution to sampling problems (see, for example, Celeux, Hum and Robert (2000) 
and Jasra, Holmes and Stephens (2005)). The benchmark for good MCMC mixing here is label-switching: 
In the Bayesian treatment of a k— component mixture model, the likelihood is invariant to the labelling 
of the components. This invariance is inherited by the posterior if, as is quite natural in many cases, the 
priors do not impose identifiability. The logical conclusion of such invariance is that a well-mixing MCMC 
sampler should visit all k\ labellings of the components. Label-switching could be achieved trivially by 
incorporating a move type which permutes the component labels, however this may mask more significant 
difficulties in moving around the state space. Certainly we can have greater confidence in the exploratory 
powers of a sampler which can swap component labels in the course of its other moves. 

We use the much-studied galaxy data set for illustration, see for example Richardson and Green (1997), 
which comprises measurements on the velocities of 82 galaxies (Figure |U). Unlike in that paper though, 
we fix the number of mixture components at three, the smallest number of components with non-negligible 
posterior probability according to Richardson and Green. Using a small number of components makes label 
switching harder as there is no "redundant" component to move freely around the state space exchanging 
identities with the less mobile components needed to explain the data if they pass sufficiently close. 

Denoting the 82 velocity measurements by y = {yi, . . . , ys2}, we follow Richardson and Green (1997) 
in incorporating corresponding latent allocation variables z = {z±, . . . , zg2}- Given Zi = j, yi follows the 
j th of the three component Gaussian distributions of the mixture, 

f(Vi\zi =3,Vj,<Tj) = -?== ex P ( ^ I « = !,•• -,82. 

^a] V 2cJ i / 

Further, conditional independence is assumed for the observations. We specify largely independent standard 
proper priors: 

3 

P(zi = j) = Wj, where V' Wj = 1 

{u>i, W2, ws} ~ Dirichlet(l, 1, 1) 

(ij ~ JV(0, 1000), j = 1,2,3 
cr| ~ InvGam(l,l), j = 1,2,3. 
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Figure 4: The galaxy data used for as illustration of the mixture modelling. 



so that the posterior of interest is 

82 3 3 82 

f{z,{wj,^,(r]} 3 j=1 \y) oc H f(yi\zi,ti Zi ,a 2 z .) x f{{ Wj }) x J[ f( H ) x J[ f(a]) x J[ f{z,\{ Wj }) (19) 

i=l j=l j=l i=l 

We know that if label switching is taking place when sampling from Equation (fl9l that the marginal 
posterior distributions for the sets of parameters of the three Gaussian components should be identical. 
Figure [5] shows the output for the {fij} parameters using 100000 iterations of standard MCMC updates 
including a burn in of 10000 iterations (Gibbs updates for {uij}, {/ij}, {&]} and a uniform Metropolis 
proposal to change the {zi} in turn); it is clear that label switching is not happening. Tempering the whole of 
the posterior defined by Equation (fl9l is problematic as there is no guarantee that the tempered distributions 
will remain proper. Instead, we follow Celeux, Hum and Robert (2000) in tempering only the likelihood 
contribution leaving the priors untempered. This approach generates proper tempered distributions provided 
the priors are proper. In the notation of previous sections, we set (5q = 1 and /3 n = 1/16 while 

3 ( 



1 



N2 



\ Zi=J 



where x = {z, {u)j, fj,j, cr|}^ =1 } and rij = Ya=i I [zi=j] j 3 = 1)2,3. Unlike in the motivating example 
of the previous section, the g(/3) coiTesponding to this form of h{x) is not available analytically and so we 
must now address the question of approximating it before we can optimise the {&}. 
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Figure 5: Histograms and trace plots of the {/ij} chains indicating a lack of mode swapping when j3 = 1. 
3.2 Approximating g(/3) 

The difficulty in estimating g(f3) = Ep[h(X)] and g'((3) = -(E fi [h(X) 2 ] - E p [h(X)] 2 ) for j3 n < (3 < f3 
is that sampling under pp is difficult for j3 close to /3o (hence the need for tempered transitions!). We 
propose instead to estimate g(/3) and g'(f3) using importance sampling. The obvious importance distribution 
to use is pp n since we have already made an assumption that we can sample from this distribution quite 
freely. However it may be a poor choice as an importance distribution for pp when /3 is close to /3q because 
when the importance distribution is quite far from the target, the resulting estimates can be dominated 
by a handful of the samples (Robert and Casella 1999). As a compromise, we importance sample for 
expectations under pp by sampling under pz for some f3 where /3 n < (3 < f3 < Po, in which case the 
unnormalised importance weights are exp(— ((3 — f3)h(x)). We note in passing that a standard result states 
that the importance distribution which minimises the variance of the importance estimate of some function 
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ip(x) is 

f*(x) oc \i/)(x)\ pp{x). 

We turn this statement around to ask for what function tp(x) is ps(x) the optimal importance distribution? 

P§( x ) 

\^{x)\ oc — -- 
p p (x) 

= exp(-0 - ft)h(x)) 

= l-(ft-ft)h(x) + ((ft-ft)h(x)f/2 + ... 

So using p~p as an importance distribution would be optimal if we were trying to estimate exp(— — 
ft)h(x)). It is not optimal for estimating Ep[h(X)] and Ep[h(X) 2 ], however it may be more reasonable for 
this goal when (ft — ft) is small, than if we were, say, trying to estimate E^X] or E^X 2 ]. 

We work with 20 uniformly spaced values of ft in the interval [ft n , fto]. As a compromise between the 
inadequate sampling for large ft and the risk of unreliable importance sampling for large ft — ft, we generate 
relatively small samples at each ft and use these samples to estimate g(ft) and g'(ft) both directly and indi- 
rectly by importance sampling using the next smallest of the 20 chosen values (with obvious modifications 
at the end points). Figure [6] shows the results when using 10000 samples at each ft and discarding the first 
1000 iterations as burn in. We propose to use the average of the two estimates for g(ft) and g'(ft) at each 
point, with visual inspection recommended to check for major discrepancies. In this example, the estimated 
g(ft) curve appears quite far from the geometric-friendly form g(ft) = ^ + K2. 

3.3 Results 

Given the importance sampling estimates of both curve g(ft) and its derivative g'(ft), we can minimise 
S n (fto, . . . , ft n ) using Equation (fT3T ). As before, we use the R optimisation routine optim with linear inter- 
polation used to evaluate g and g' between the 20 ft values. In order to assess the effects of the imperfect 
estimation of g(ft) on the tuning procedure, we replicate the estimation process five times with each replicate 
being used to select {fti}. Figure|7]shows both the variability in estimated g and g' and how the optimised S n 
decreases with n for the five sets of estimates. By letting n become sufficiently large, it would be possible to 
reduce S n below any positive threshold. (An upper bound on the minimum S n is ^{fto— ftn)(g(ftn)— g(fto)), 
achieved either by uniformly spacing ft\, . . . , ft n -\ or by uniformly spacing g(fti), ■ ■ ■ , g{ft n -\).) However 
as the computational cost of the tempering increases linearly in n, the curves show that costs grow quite 
rapidly for relatively small decreases in S n . 

Turning to the tempered transitions themselves, we ran the algorithm for 100000 iterations including a 
burn-in of 10000 iterations. At each tempering stage, the same proposal types were used as in the importance 
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Figure 6: Estimating g(/3) and g'((3): Symbols indicate the /3 at which samples are generated; red circles 
and red lines segments indicate direct sampling; blue triangles and blue line segments indicate importance 
sampling estimates; black lines are linear interpolations. 

sampling. For each of the five sets of importance estimates, we temper using n = 64, 128, 256, 512 and 
compare the optimised /3 results with those of geometric spacing. We know that if switching is taking place, 
then the marginals for each component should be identical. This obviously implies that the three posterior 
expected should be equal, as should the expected weights and the expected variances. We propose to 
monitor the mixing using the usual tool of the integrated autocorrelation times, however in estimating these 
diagnostics we use the averages of each group of parameters over the three chains rather than the usual 
chain-wise average for each parameter. For example, if the labels swap regularly, the individual averages 
of each of the three fii chains will be close to their overall average, and the non-centred autocorrelations 
will not be much different from the standard centred ones. On the other hand, if the labels do not switch as 
was the case with standard MCMC illustrated in Figure [51 the autocorrelations calculated around the overall 
average will be greatly increased and this will be reflected in the modified integrated autocorrelation times. 
Table [2] summarises the results, showing the acceptance rates and the estimated integrated autocorrela- 
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Table 2: Results for the mixture problem using different numbers of tempering levels. Results are shown 
for geometric spacing of the {Pi} and for the tuned spacing from the five replicates of importance sampling. 
*** indicates that the estimates of integrated autocorrelation times did not converge reliably. 
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Figure 7: Left: five replicates of the estimated g({3) in black and g'(/3) in red line segments with {/3} 
indicated by circles. Right: the five corresponding minimum S n against the number of tempering levels n. 

tions times for the three sets of parameters. The first point to note is how large n needs to be in order to 
achieve even low acceptance rates. This is not unexpected; Jasra, Holmes and Stephens (2005) describe 
"huge rejection rates when sampling from the full posterior" using tempered transitions. Although temper- 
ing moves may not be accepted very often, each one can make a large move in the state space and it is 
common practice to intersperse tempering moves with standard MCMC moves for improved local explo- 
ration. The fact that acceptance rates can be so low highlights the importance of any tuning. The worst case 
is geometric spacing when n = 64; here the actual number of acceptances is so low, just 13, that the esti- 
mates of integrated autocorrelation times fail to converge reliably (taken to mean that the estimate exceeded 
a tenth of the total run length). As n increases, acceptance rates improve and integrated autocorrelation 
times decrease for all runs. There is some variability between the five replicates of the tuned spacings of 
the {Pi}, however at all of the n considered, all five outperform geometric spacing by some considerable 
margin in terms of the acceptance rates and consequently the integrated autocorrelation times. 

The cost of tuning the {/%} comprises the cost of the samples required to estimate g(/3) and g'(/3) using 
importance sampling plus the optimisation costs for minimising S n . Here we used 20 relatively short runs, 
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Table 3: Time in user CPU time seconds for the geometrically spaced temperatures and for the tuned tem- 
peratures including a breakdown of the cost of tuning. 

each of only 10000 iterations, for the g{(3) and g'{/3) estimation; this stage is independent of n. The cost of 
the deterministic minimisation of S n in R increases with n but is of the order of a few minutes for n = 512. 
This makes the cost of the tuning procedure a small fraction of the total cost. Full details are given in 
Table [3] showing that the additional cost of the tuning procedure, for this example, varied between 2% and 
4% extra CPU time compared to the untuned procedure. Combining this information with the integrated 
autocorrelation times in Table [2] indicates the tuned procedure gives substantial improvements in mixing 
compared to the geometric temperature placement. 

In this example, very little of the total computational effort was spent in estimating g{fi) and </(/?). 
Although importance sampling cannot be guaranteed to be particularly good for this type of problem, we 
suggest that this is a sensible strategy. The associated risk is either that the importance sampling fails to 
identify a g(/3) curve which is not suitable for geometric spacing or, conversely, that it identifies interesting 
features which are not in fact present. In the former case, a visual inspection of the roughly estimated g((5) 
may suggest that it is not worthwhile implementing any optimisation, reverting to the default geometric, 
and so the wasted CPU time is minimal. (The same argument is also reasonable when importance sampling 
works well for estimating <?(/?).) On the other hand, if the estimated g{(3) looks to be of the form where 
tuning may help, more computational effort could be put into improving the accuracy of its estimation, 
especially if a discrepancy is noted between the values of the curve using direct and indirect sampling. 

4 Discussion 

In this paper we have explored how to tune the expensive tempered transitions algorithm to make best 
use of computational resources. We have shown that the geometric schedule will be optimal if the curve 
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g{f3) = Ep[h(X)] is of a particular form, where the target distribution is p(x) oc tt(x) exp(— /3q h(x)). The 
tuning itself is relatively cheap and examples have demonstrated that it can make a significant difference. 

Although we have not explicitly considered the question of choosing the number of tempering levels, 
we have some purely anecdotal evidence that the tuning procedure may yield useful information regarding 
the minimum number of tempering levels required. In our experience, tempered transitions does not seem 
to perform at all well with a {/3j} for which S n > 1. For example, in our mixture example the geometric S n 
is approximately of the order 2, 1, 0.5 and 0.25 for n = 64, 128, 256, 512 respectively, while for the tuned 
{fii} it is of the corresponding approximate value 1.2, 0.6, 0.3 and 0. 15. It also seems feasible that the tuning 
approach proposed here may also be relevant to some of the other MCMC algorithms which incorporate an 
element of tempering. This is another topic for future research. 
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